In our Part 1, we analyzed a hotel booking dataset from the INN
Hotels Group to understand the factors influencing booking
cancellations.
hotel_data <- read.csv("../../Dataset/INNHotelsGroup_min.csv")
hotel_datay <- read.csv("../../Dataset/INNHotelsGroup.csv")
str(hotel_datay)
## 'data.frame': 36275 obs. of 19 variables:
## $ Booking_ID : chr "INN00001" "INN00002" "INN00003" "INN00004" ...
## $ no_of_adults : int 2 2 1 2 2 2 2 2 3 2 ...
## $ no_of_children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_weekend_nights : int 1 2 2 0 1 0 1 1 0 0 ...
## $ no_of_week_nights : int 2 3 1 2 1 2 3 3 4 5 ...
## $ type_of_meal_plan : chr "Meal Plan 1" "Not Selected" "Meal Plan 1" "Meal Plan 1" ...
## $ required_car_parking_space : int 0 0 0 0 0 0 0 0 0 0 ...
## $ room_type_reserved : chr "Room_Type 1" "Room_Type 1" "Room_Type 1" "Room_Type 1" ...
## $ lead_time : int 224 5 1 211 48 346 34 83 121 44 ...
## $ arrival_year : int 2017 2018 2018 2018 2018 2018 2017 2018 2018 2018 ...
## $ arrival_month : int 10 11 2 5 4 9 10 12 7 10 ...
## $ arrival_date : int 2 6 28 20 11 13 15 26 6 18 ...
## $ market_segment_type : chr "Offline" "Online" "Online" "Online" ...
## $ repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_price_per_room : num 65 106.7 60 100 94.5 ...
## $ no_of_special_requests : int 0 1 0 0 0 1 1 1 1 3 ...
## $ booking_status : chr "Not_Canceled" "Not_Canceled" "Canceled" "Canceled" ...
The dataset contains hotel bookings collected from 2017-2018 and is
characterized by a total of 19 columns, comprising 5 categorical and 14
numerical variables.
summary(hotel_data)
## no_of_adults no_of_children no_of_weekend_nights no_of_week_nights
## Min. :0.00 Min. : 0.00 Min. :0.00 Min. : 0.0
## 1st Qu.:2.00 1st Qu.: 0.00 1st Qu.:0.00 1st Qu.: 1.0
## Median :2.00 Median : 0.00 Median :1.00 Median : 2.0
## Mean :1.84 Mean : 0.11 Mean :0.81 Mean : 2.2
## 3rd Qu.:2.00 3rd Qu.: 0.00 3rd Qu.:2.00 3rd Qu.: 3.0
## Max. :4.00 Max. :10.00 Max. :7.00 Max. :17.0
## type_of_meal_plan required_car_parking_space room_type_reserved lead_time
## Length:36275 Min. :0.000 Length:36275 Min. : 0
## Class :character 1st Qu.:0.000 Class :character 1st Qu.: 17
## Mode :character Median :0.000 Mode :character Median : 57
## Mean :0.031 Mean : 85
## 3rd Qu.:0.000 3rd Qu.:126
## Max. :1.000 Max. :443
## arrival_year arrival_month market_segment_type repeated_guest
## Min. :2017 Min. : 1.00 Length:36275 Min. :0.000
## 1st Qu.:2018 1st Qu.: 5.00 Class :character 1st Qu.:0.000
## Median :2018 Median : 8.00 Mode :character Median :0.000
## Mean :2018 Mean : 7.42 Mean :0.026
## 3rd Qu.:2018 3rd Qu.:10.00 3rd Qu.:0.000
## Max. :2018 Max. :12.00 Max. :1.000
## no_of_previous_cancellations no_of_special_requests booking_status
## Min. : 0.00 Min. :0.00 Length:36275
## 1st Qu.: 0.00 1st Qu.:0.00 Class :character
## Median : 0.00 Median :0.00 Mode :character
## Mean : 0.02 Mean :0.62
## 3rd Qu.: 0.00 3rd Qu.:1.00
## Max. :13.00 Max. :5.00
## avg_price_per_room
## Min. : 0
## 1st Qu.: 80
## Median : 99
## Mean :103
## 3rd Qu.:120
## Max. :540
Summary:
The average price per room in the dataset is 103 euros, with a median of
99 euros, but prices can reach up to 540 euros, indicating high-priced
outliers. Some entries even show an average price of zero, possibly
reflecting promotional deals. Guests typically stay for two weekday
nights and one weekend night, with the average number of weekday nights
being 2.2 and weekend nights 0.81. Stays can extend to as many as 17
weekday nights. Most bookings involve two adults, and many guests do not
bring children. Lead times vary significantly, with an average of 85
days, a median of 57, and some bookings made up to 443 days in advance,
suggesting a right-skewed distribution. The data also shows sparse
previous cancellations, with an average of just 0.02, and a maximum of
58. Bookings are spread over 2017 and 2018, peaking in August.
Additionally, most guests do not make special requests, as the median is
zero, although some make up to five requests per booking.
print(ggplot(hotel_data, aes(x = booking_status)) +
geom_bar(fill = "lightblue") +
labs(title = "Booking Status Distribution", x = "Booking Status", y = "Count"))
Summary
The dataset contains 36,275 bookings. Of the total bookings, 24,390
bookings were not canceled (67.2%), while 11,885 bookings were canceled
(32.8%), reflecting a significant cancellation rate that offers rich
insights into factors influencing booking decisions.
print(ggplot(hotel_data, aes(x = factor(no_of_special_requests))) +
geom_bar(fill = "lightblue") +
labs(title = "Number of Special Requests", x = "Number of Special Requests", y = "Count") +
theme_minimal())
Summary
Most guests make no special requests, with a sharp decline as the number
increases. A significant portion makes one request, while two or more
requests are increasingly rare.
contingency_table <- table(hotel_data$market_segment_type, hotel_data$booking_status)
plot_data <- as.data.frame(contingency_table)
colnames(plot_data) <- c("Market_Segment", "Booking_Status", "Count")
print(ggplot(plot_data, aes(x = Market_Segment, y = Count, fill = Booking_Status)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Booking Status by Market Segment",
x = "Market Segment",
y = "Count",
fill = "Booking Status") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)))
Summary
Online market segment have the highest cancellation count, with a large
proportion of canceled bookings of 8475. Offline segment has a
cancellation count of 3253 For other market segment, the cancellation
rates are too less.
ggplot(hotel_data, aes(x = lead_time, fill = booking_status)) +
geom_histogram(binwidth = 10, position = "dodge") +
labs(
title = "Lead Time vs Booking Status",
x = "Lead Time (Days)",
y = "Number of Bookings",
fill = "Booking Status"
) +
theme_minimal()
Summary
Here we can see that the booking with short lead times are less
cancelled and as the lead time increases there are more booking
cancellations. Also, we can see more cancellation happening between lead
time form 100-200.
ggplot(hotel_data, aes(x = booking_status, y = avg_price_per_room, fill = booking_status)) +
geom_boxplot() +
labs(
title = "Average Room Price vs Booking Status",
x = "Booking Status",
y = "Average Room Price"
) +
theme_minimal() +
theme(legend.position = "none")
Summary
The booking with cancelled status has higher median average room price
value compared to non-cancelled booking status.This might mean that the
booking with her average room price have more chances to get
cancelled.
1. Special Requests and Cancellations:
Bookings with special requests had a significantly lower cancellation
rate (20.2%) compared to those without (43.2%). Chi-square test
revealed a statistically significant association between special
requests and booking status.
2. Previous Cancellation History:
Surprisingly, guests with previous cancellations had a much lower
current cancellation rate (4.73%) compared to those without
(33.03%).
This suggests that past cancellation history might not be a
straightforward predictor of future booking behavior.
3. Factors Influencing Cancellations:
3.1 Lead Time:
3.2 Room Price:
4. Seasonal Variations:
The correlation between factors and cancellations varied across
seasons:
4.1 Fall:
Strongest lead time correlation (0.54).
4.2 Summer:
Special requests most negatively correlated with cancellations
(-0.30).
4.3 Spring:
Unique pattern with special requests having a strong negative
correlation (-0.36).
4.4 Winter:
Slightly different dynamics with weaker correlations.
hotel_data$booking_status_binary <- ifelse(hotel_data$booking_status == "Canceled", 1, 0)
remove_zero_variance <- function(df) {
df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}
data_filtered <- remove_zero_variance(select_if(hotel_data, is.numeric))
cor_data <- cor(data_filtered, use = "complete.obs")
corrplot(cor_data, method = "color", addCoef.col = "black",
title = "Correlation Matrix Recap", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
Lead time strongly predicts cancellations, with longer lead times
increasing likelihood (0.44). More special requests (-0.25) and repeated
guest status (-0.11) reduce cancellations, reflecting customer
commitment and loyalty. Price per room has a weak positive correlation
(0.14) with cancellations. Factors like travel party size and stay
length show minimal impact on cancellation likelihood.
1. Lead time emerged as the most critical factor in
predicting booking cancellations.
2. Special requests significantly reduce the likelihood
of cancellations.
3. Booking behavior varies considerably across
different seasons.
4. Higher-priced rooms show a slight tendency towards
more cancellations.
In this phase, we aim to build a predictive model for hotel booking cancellations and explore key factors driving cancellation behavior. The analysis will address the following critical questions:
Can we predict the likelihood of booking cancellation based on the
lead time?
Can we predict peak booking times and high-demand periods?
What is the relationship between room price and the likelihood of a
booking being canceled?
How do seasonal trends impact cancellation rates and booking
patterns?
Which factors most strongly influence booking cancellation
decisions?
To achieve these objectives, we will analyze customer attributes such as room price, lead time, and seasonal trends. Using logistic and Random forest models, we aim to predict the likelihood of booking cancellations and identify actionable insights. The results will help hotels effectively manage cancellation risks, optimize revenue strategies, and better anticipate high-demand periods.
set.seed(123)
train_index <- createDataPartition(hotel_data$booking_status_binary, p = 0.7, list = FALSE)
train_data <- hotel_data[train_index, ]
test_data <- hotel_data[-train_index, ]
logistic_model <- glm(
booking_status_binary ~ lead_time,
data = train_data,
family = binomial()
)
summary(logistic_model)
##
## Call:
## glm(formula = booking_status_binary ~ lead_time, family = binomial(),
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.82038 0.02328 -78.2 <2e-16 ***
## lead_time 0.01180 0.00019 62.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 32083 on 25392 degrees of freedom
## Residual deviance: 27091 on 25391 degrees of freedom
## AIC: 27095
##
## Number of Fisher Scoring iterations: 4
test_predictions <- predict(logistic_model, newdata = test_data, type = "response")
predicted_classes <- ifelse(test_predictions > 0.5, 1, 0)
confusion_matrix <- table(Actual = test_data$booking_status_binary,
Predicted = predicted_classes)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(confusion_matrix)
## Predicted
## Actual 0 1
## 0 6720 570
## 1 2102 1490
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
precision <- confusion_matrix[2,2] / sum(confusion_matrix[,2])
recall <- confusion_matrix[2,2] / sum(confusion_matrix[2,])
f1_score <- 2 * (precision * recall) / (precision + recall)
cat("\nModel Performance Metrics:\n")
##
## Model Performance Metrics:
cat("Accuracy:", round(accuracy, 4), "\n")
## Accuracy: 0.754
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.723
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.415
cat("F1 Score:", round(f1_score, 4), "\n")
## F1 Score: 0.527
1. Accuracy: 0.754, meaning the model correctly
classifies 75.4% of the bookings.
2. Precision: 0.723, indicating that 72.3% of the
predictions of canceled bookings are correct.
3. Recall: 0.415, meaning the model correctly
identifies 41.5% of the actual canceled bookings.
4. F1-Score: 0.527, which is the harmonic mean of
precision and recall, providing a balanced evaluation of the model’s
performance.
roc_curve <- roc(test_data$booking_status_binary, test_predictions)
plot(roc_curve, main = "ROC Curve for Booking Cancellation Prediction")
The plot indicates that as the specificity (the ability to correctly
identify non-canceled bookings) increases, the sensitivity (the ability
to correctly identify canceled bookings) also increases. This suggests
that the model has good discriminative power in predicting booking
cancellations.
auc_value <- auc(roc_curve)
cat("Area Under the ROC Curve (AUC):", round(auc_value, 4), "\n")
## Area Under the ROC Curve (AUC): 0.75
Area Under the ROC Curve (AUC): 0.75, suggesting the model has good
discriminative power in predicting booking cancellations.
hotel_data$booking_status_binary <- as.numeric(as.character(hotel_data$booking_status_binary))
ggplot(hotel_data, aes(x = lead_time, y = booking_status_binary)) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE) +
labs(
title = "Probability of Booking Cancellation vs Lead Time",
x = "Lead Time (Days)",
y = "Probability of Cancellation"
) +
theme_minimal()
The graph shows the relationship between the lead time (in days) and
the probability of booking cancellation. As the lead time increases, the
probability of booking cancellation rises in a non-linear fashion, with
the curve becoming more steep at higher lead times.
prepare_booking_data <- function(hotel_data) {
booking_data <- hotel_data %>%
mutate(
high_demand = as.factor(ifelse(
avg_price_per_room > quantile(avg_price_per_room, 0.75),
1, 0
)),
booking_month = as.factor(arrival_month),
is_weekend = no_of_weekend_nights > 0,
total_nights = no_of_weekend_nights + no_of_week_nights,
is_repeated_guest = repeated_guest > 0
)
return(booking_data)
}
prepared_data <- prepare_booking_data(hotel_data)
set.seed(123)
train_index <- createDataPartition(prepared_data$high_demand, p = 0.7, list = FALSE)
train_data <- prepared_data[train_index, ]
test_data <- prepared_data[-train_index, ]
# Random Forest Model
rf_model <- randomForest(
high_demand ~ lead_time +
total_nights +
booking_month +
market_segment_type +
is_repeated_guest,
data = train_data,
ntree = 500,
importance = TRUE
)
evaluate_model <- function(actual, predicted, model_name) {
conf_matrix <- confusionMatrix(as.factor(predicted), as.factor(actual))
results <- data.frame(
Model = model_name,
Accuracy = conf_matrix$overall['Accuracy'],
Precision = conf_matrix$byClass['Precision'],
Recall = conf_matrix$byClass['Recall'],
F1 = conf_matrix$byClass['F1']
)
return(results)
}
rf_pred <- predict(rf_model, newdata = test_data, type = "prob")[,2]
rf_class <- ifelse(rf_pred > 0.5, 1, 0)
rf_results <- evaluate_model(test_data$high_demand, rf_class, "Random Forest")
print(rf_results)
## Model Accuracy Precision Recall F1
## Accuracy Random Forest 0.81 0.847 0.912 0.878
rf_importance <- data.frame(
Feature = rownames(importance(rf_model)),
Importance = importance(rf_model)[,1]
) %>% arrange(desc(Importance))
print(head(rf_importance, 10))
## Feature Importance
## booking_month booking_month 100.9
## lead_time lead_time 92.9
## market_segment_type market_segment_type 85.6
## total_nights total_nights 50.4
## is_repeated_guest is_repeated_guest 12.9
ggplot(head(rf_importance, 10), aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(
title = "Top 10 Features Predicting High-Demand Periods",
x = "Features",
y = "Importance"
) +
theme_minimal()
predict_high_demand_periods <- function(model, test_data) {
pred_probs <- predict(model, newdata = test_data, type = "prob")[,2]
high_demand_periods <- test_data[pred_probs > 0.7, ]
print("High-Demand Periods Prediction:")
print(paste("Total High-Demand Periods:", sum(pred_probs > 0.7)))
print("Sample of High-Demand Periods:")
print(head(high_demand_periods))
return(high_demand_periods)
}
high_demand_periods <- predict_high_demand_periods(rf_model, test_data)
## [1] "High-Demand Periods Prediction:"
## [1] "Total High-Demand Periods: 1462"
## [1] "Sample of High-Demand Periods:"
## no_of_adults no_of_children no_of_weekend_nights no_of_week_nights
## 62 1 0 0 3
## 110 2 0 2 1
## 181 2 0 0 3
## 202 2 0 0 2
## 223 2 0 1 5
## 232 2 1 1 3
## type_of_meal_plan required_car_parking_space room_type_reserved lead_time
## 62 Meal Plan 1 0 Room_Type 4 19
## 110 Meal Plan 1 0 Room_Type 1 32
## 181 Meal Plan 1 0 Room_Type 1 11
## 202 Meal Plan 1 0 Room_Type 1 80
## 223 Meal Plan 1 0 Room_Type 1 56
## 232 Meal Plan 1 0 Room_Type 1 77
## arrival_year arrival_month market_segment_type repeated_guest
## 62 2018 5 Online 0
## 110 2017 9 Online 0
## 181 2017 9 Online 0
## 202 2017 7 Online 0
## 223 2018 9 Online 0
## 232 2018 5 Online 0
## no_of_previous_cancellations no_of_special_requests booking_status
## 62 0 2 Not_Canceled
## 110 0 3 Not_Canceled
## 181 0 1 Not_Canceled
## 202 0 1 Canceled
## 223 0 0 Not_Canceled
## 232 0 1 Not_Canceled
## avg_price_per_room booking_status_binary high_demand booking_month
## 62 120.1 0 1 5
## 110 94.5 0 0 9
## 181 80.8 0 0 9
## 202 76.5 1 0 7
## 223 119.0 0 0 9
## 232 135.2 0 1 5
## is_weekend total_nights is_repeated_guest
## 62 FALSE 3 FALSE
## 110 TRUE 3 FALSE
## 181 FALSE 3 FALSE
## 202 FALSE 2 FALSE
## 223 TRUE 6 FALSE
## 232 TRUE 4 FALSE
# Fit a logistic regression model
lrmodel1 <- glm(booking_status_binary ~ avg_price_per_room, data = hotel_data, family = binomial)
summary(lrmodel1)
##
## Call:
## glm(formula = booking_status_binary ~ avg_price_per_room, family = binomial,
## data = hotel_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.632519 0.036523 -44.7 <2e-16 ***
## avg_price_per_room 0.008695 0.000326 26.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 45887 on 36274 degrees of freedom
## Residual deviance: 45153 on 36273 degrees of freedom
## AIC: 45157
##
## Number of Fisher Scoring iterations: 4
hotel_data$predicted_prob <- predict(lrmodel1, type = "response")
hotel_data$predicted_class <- ifelse(hotel_data$predicted_prob > 0.5, 1, 0)
# Model evaluation
conf_matrix <- table(hotel_data$booking_status_binary, hotel_data$predicted_class)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.67"
roc_curve <- roc(hotel_data$booking_status_binary, hotel_data$predicted_prob)
plot(roc_curve)
auc_value <- auc(roc_curve)
print(paste("AUC:", round(auc_value, 2)))
## [1] "AUC: 0.6"
# Scatter plot to examine the relationship between price and cancellation
plot(hotel_data$avg_price_per_room, hotel_data$booking_status_binary,
main = "Scatter plot of Avg Price per Room vs Booking Status",
xlab = "Average Price per Room", ylab = "Booking Status")
# Boxplot for average price per room vs booking cancellation status
boxplot(avg_price_per_room ~ booking_status_binary, data = hotel_data,
main = "Boxplot of Avg Price per Room vs Booking Status",
xlab = "Booking Status", ylab = "Average Price per Room")
hotel_data$season <- case_when(
hotel_data$arrival_month %in% c(3, 4, 5) ~ "Spring",
hotel_data$arrival_month %in% c(6, 7, 8) ~ "Summer",
hotel_data$arrival_month %in% c(9, 10, 11) ~ "Fall",
hotel_data$arrival_month %in% c(12, 1, 2) ~ "Winter",
TRUE ~ "Unknown"
)
# Create subsets for each season
spring_data <- subset(hotel_data, season == "Spring")
summer_data <- subset(hotel_data, season == "Summer")
fall_data <- subset(hotel_data, season == "Fall")
winter_data <- subset(hotel_data, season == "Winter")
# Function to remove columns with zero variance
remove_zero_variance <- function(df) {
df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}
# Remove zero variance columns for each season
spring_data_filtered <- remove_zero_variance(select_if(spring_data, is.numeric))
summer_data_filtered <- remove_zero_variance(select_if(summer_data, is.numeric))
fall_data_filtered <- remove_zero_variance(select_if(fall_data, is.numeric))
winter_data_filtered <- remove_zero_variance(select_if(winter_data, is.numeric))
# Calculate and visualize correlation for Spring
cor_spring <- cor(spring_data_filtered, use = "complete.obs")
corrplot(cor_spring, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Spring", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
# Calculate and visualize correlation for Summer
cor_summer <- cor(summer_data_filtered, use = "complete.obs")
corrplot(cor_summer, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Summer", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
# Calculate and visualize correlation for Fall
cor_fall <- cor(fall_data_filtered, use = "complete.obs")
corrplot(cor_fall, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Fall", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
# Calculate and visualize correlation for Winter
cor_winter <- cor(winter_data_filtered, use = "complete.obs")
corrplot(cor_winter, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Winter", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
### Logistic Regression: Impact of Seasonality on Booking Status
# Fit logistic regression model
logit_model <- glm(booking_status_binary ~ season, family = binomial, data = hotel_data)
# Summarize the logistic regression model
summary(logit_model)
##
## Call:
## glm(formula = booking_status_binary ~ season, family = binomial,
## data = hotel_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6965 0.0187 -37.28 <2e-16 ***
## seasonSpring 0.0492 0.0304 1.62 0.11
## seasonSummer 0.3406 0.0276 12.32 <2e-16 ***
## seasonWinter -1.0447 0.0415 -25.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 45887 on 36274 degrees of freedom
## Residual deviance: 44616 on 36271 degrees of freedom
## AIC: 44624
##
## Number of Fisher Scoring iterations: 4
# Calculate odds ratios for the season coefficients
odds_ratios <- exp(coef(logit_model))
print(odds_ratios)
## (Intercept) seasonSpring seasonSummer seasonWinter
## 0.498 1.050 1.406 0.352
# Predict probabilities using the logistic regression model
hotel_data$predicted_probs <- predict(logit_model, type = "response")
# Classify predictions using a 0.5 cutoff
hotel_data$predicted_class <- ifelse(hotel_data$predicted_probs >= 0.5, 1, 0)
# Confusion matrix
confusion_matrix <- table(Predicted = hotel_data$predicted_class, Actual = hotel_data$booking_status_binary)
print(confusion_matrix)
## Actual
## Predicted 0 1
## 0 24390 11885
# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.672
# Fit a decision tree with adjusted parameters
tree_model <- rpart(
booking_status_binary ~ season + lead_time + no_of_special_requests,
data = hotel_data,
method = "class",
control = rpart.control(cp = 0.001, minsplit = 10, maxdepth = 5) # Lower cp and minsplit to allow more splits
)
# Plot the updated decision tree
rpart.plot(tree_model, type = 3, extra = 102, fallen.leaves = TRUE,
main = "Decision Tree: Seasonal Trends and Cancellations")
hotel_data$type_of_meal_plan <- as.factor(hotel_data$type_of_meal_plan)
hotel_data$room_type_reserved <- as.factor(hotel_data$room_type_reserved)
hotel_data$market_segment_type <- as.factor(hotel_data$market_segment_type)
hotel_data$arrival_month <- as.factor(hotel_data$arrival_month)
rf_model1 <-randomForest(booking_status_binary ~ no_of_adults + no_of_children + no_of_weekend_nights +
no_of_week_nights + type_of_meal_plan + required_car_parking_space +
room_type_reserved + lead_time + arrival_year + arrival_month +
market_segment_type + repeated_guest +
no_of_previous_cancellations + avg_price_per_room + no_of_special_requests,
data = hotel_data, importance = TRUE, ntree = 100)
feature_importance <- importance(rf_model1)
importance_df <- data.frame(Feature = rownames(feature_importance), Importance = feature_importance[,1])
importance_df <- importance_df[order(-importance_df$Importance),]
importance_df
## Feature Importance
## lead_time lead_time 214.99
## no_of_special_requests no_of_special_requests 183.66
## arrival_month arrival_month 109.45
## avg_price_per_room avg_price_per_room 104.81
## market_segment_type market_segment_type 97.86
## no_of_week_nights no_of_week_nights 52.54
## no_of_weekend_nights no_of_weekend_nights 48.14
## required_car_parking_space required_car_parking_space 42.51
## no_of_adults no_of_adults 33.90
## type_of_meal_plan type_of_meal_plan 32.13
## room_type_reserved room_type_reserved 31.48
## arrival_year arrival_year 25.43
## no_of_children no_of_children 19.48
## repeated_guest repeated_guest 10.73
## no_of_previous_cancellations no_of_previous_cancellations 4.74
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + # Flip the axes to make it easier to read
labs(title = "Feature Importance from Random Forest Model",
x = "Feature", y = "Importance") +
theme_minimal()
# Converting booking status to numeric
hotel_data$booking_status_binary <- as.numeric(hotel_data$booking_status_binary)
hotel_data$market_segment_type_num <- as.numeric(hotel_data$market_segment_type)
hotel_data$arrival_month_num <- as.numeric(hotel_data$arrival_month)
corr_data <- hotel_data[, c("booking_status_binary", "lead_time", "no_of_special_requests",
"avg_price_per_room", "market_segment_type_num", "no_of_week_nights",
"no_of_weekend_nights", "arrival_month_num")]
#correlation matrix
cor_matrix <- cor(corr_data)
print(cor_matrix)
## booking_status_binary lead_time no_of_special_requests
## booking_status_binary 1.0000 0.43854 -0.2531
## lead_time 0.4385 1.00000 -0.1016
## no_of_special_requests -0.2531 -0.10164 1.0000
## avg_price_per_room 0.1426 -0.06260 0.1844
## market_segment_type_num 0.1360 -0.00693 0.3085
## no_of_week_nights 0.0930 0.14965 0.0460
## no_of_weekend_nights 0.0616 0.04660 0.0606
## arrival_month_num -0.0112 0.13681 0.1106
## avg_price_per_room market_segment_type_num
## booking_status_binary 0.14257 0.13601
## lead_time -0.06260 -0.00693
## no_of_special_requests 0.18438 0.30848
## avg_price_per_room 1.00000 0.37565
## market_segment_type_num 0.37565 1.00000
## no_of_week_nights 0.02275 0.11295
## no_of_weekend_nights -0.00452 0.12907
## arrival_month_num 0.05442 -0.00631
## no_of_week_nights no_of_weekend_nights
## booking_status_binary 0.0930 0.06156
## lead_time 0.1497 0.04660
## no_of_special_requests 0.0460 0.06059
## avg_price_per_room 0.0228 -0.00452
## market_segment_type_num 0.1130 0.12907
## no_of_week_nights 1.0000 0.17958
## no_of_weekend_nights 0.1796 1.00000
## arrival_month_num 0.0374 -0.00989
## arrival_month_num
## booking_status_binary -0.01123
## lead_time 0.13681
## no_of_special_requests 0.11055
## avg_price_per_room 0.05442
## market_segment_type_num -0.00631
## no_of_week_nights 0.03738
## no_of_weekend_nights -0.00989
## arrival_month_num 1.00000
# Plotting the correlation matrix using corrplot
corrplot(cor_matrix, method = "circle", type = "lower", tl.col = "black", tl.srt = 45)
# Assuming you've already fitted the logistic regression model
lm_model1 <- lm(booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room +
market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month,
data = hotel_data)
#summary(lm_model1)
gvif_values <- vif(lm_model1)
print(gvif_values)
## GVIF Df GVIF^(1/(2*Df))
## lead_time 1.25 1 1.12
## no_of_special_requests 1.21 1 1.10
## avg_price_per_room 1.44 1 1.20
## market_segment_type 1.70 4 1.07
## no_of_week_nights 1.08 1 1.04
## no_of_weekend_nights 1.06 1 1.03
## arrival_month 1.36 11 1.01
hotel_data$booking_status_binary <- as.factor(hotel_data$booking_status_binary)
# Fit Random Forest model for classification (binary target)
rf_model2 <- randomForest(booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room +
market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month,
data = hotel_data, ntree = 100)
# Print the model summary to ensure it's treated as classification
print(rf_model2)
##
## Call:
## randomForest(formula = booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room + market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month, data = hotel_data, ntree = 100)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 10.8%
## Confusion matrix:
## 0 1 class.error
## 0 22868 1522 0.0624
## 1 2386 9499 0.2008
pred_class_rf <- predict(rf_model2, newdata = hotel_data)
# Confusion Matrix (Make sure both predicted and actual values have the same levels)
conf_matrix_rf <- confusionMatrix(pred_class_rf, hotel_data$booking_status_binary)
conf_matrix_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23496 1592
## 1 894 10293
##
## Accuracy : 0.931
## 95% CI : (0.929, 0.934)
## No Information Rate : 0.672
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.842
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.963
## Specificity : 0.866
## Pos Pred Value : 0.937
## Neg Pred Value : 0.920
## Prevalence : 0.672
## Detection Rate : 0.648
## Detection Prevalence : 0.692
## Balanced Accuracy : 0.915
##
## 'Positive' Class : 0
##
# Print confusion matrix and the calculated metrics
conf_matrix_df <- as.data.frame(as.table(conf_matrix_rf))
conf_matrix_df
## Prediction Reference Freq
## 1 0 0 23496
## 2 1 0 894
## 3 0 1 1592
## 4 1 1 10293
colnames(conf_matrix_df) <- c("Actual", "Predicted", "Freq") # Rename columns for clarity
# Plot confusion matrix as heatmap using ggplot2
ggplot(conf_matrix_df, aes(x = Predicted, y = Actual, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "black", size = 6) +
scale_fill_gradient(low = "white", high = "blue") +
labs(x = "Predicted", y = "Actual", title = "Confusion Matrix Heatmap") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Extracting Precision, Recall, and Accuracy from the confusion matrix
precision <- conf_matrix_rf$byClass['Pos Pred Value']
recall <- conf_matrix_rf$byClass['Sensitivity']
accuracy <- conf_matrix_rf$overall['Accuracy']
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.936
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.963
cat("Accuracy:", round(accuracy, 4), "\n")
## Accuracy: 0.931
# Predict class probabilities for the binary outcome (probabilities for each class, we need the second column for class 1)
pred_probs_rf <- predict(rf_model2, newdata = hotel_data, type = "prob")[, 2]
# Calculate ROC-AUC for Random Forest (use the second column for class 1 probabilities)
roc_curve_rf <- roc(hotel_data$booking_status_binary, pred_probs_rf)
# Plot the ROC curve
plot(roc_curve_rf, main = "ROC Curve for Random Forest Model")
# Print AUC value
auc_value_rf <- auc(roc_curve_rf)
cat("AUC:", round(auc_value_rf, 4), "\n")
## AUC: 0.986
# Set up parallel processing (use all but 1 core)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
# Reduce number of trees for faster cross-validation
ntree_val <- 50
# Perform cross-validation with Random Forest using 5-fold cross-validation
cv_model <- train(booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room +
market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month,
data = hotel_data,
method = "rf",
trControl = trainControl(method = "cv", number = 5), # 5-fold cross-validation
tuneGrid = data.frame(mtry = 3), # You can adjust mtry for tuning
ntree = ntree_val) # Set ntree to a smaller value for faster computation
# Print the results of cross-validation
print(cv_model)
## Random Forest
##
## 36275 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 29020, 29020, 29020, 29020, 29020
## Resampling results:
##
## Accuracy Kappa
## 0.865 0.679
##
## Tuning parameter 'mtry' was held constant at a value of 3
# Stop the parallel cluster
stopCluster(cl)
Antonio, N., de Almeida, A., & Nunes, L. (2017). Predicting hotel bookings cancellation with a machine learning classification model. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW) (pp. 1100–1107). IEEE. https://ieeexplore.ieee.org/document/8260781
Antonio, N., de Almeida, A., & Nunes, L. (2017). Predicting hotel booking cancellations to decrease uncertainty and increase revenue.Tourism & Management Studies, 13(2), 25–39. https://www.researchgate.net/publication/310504011_Predicting_Hotel_Booking_Cancellation_to_Decrease_Uncertainty_and_Increase_Revenue
Abeyrathne, C., & Bandara, H. M. N. D. (2023). Hotel booking cancellation prediction system using machine learning. Faculty of Engineering, University of Ruhuna. https://www.researchgate.net/publication/380515766_Hotel_Booking_Cancellation_Prediction_System_using_Machine_Learning